In [2]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.linear_model import LinearRegression
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from IPython.display import Image, display

Simple Linear Regression

This section uses a linear regression problem as a means to explain what neural networks are and how they are trained. We'll start with simple linear regression where we have just one predictor. First we'll generate 10 random numbers between 0 and 5, which we'll put into a 1D array called X.

Note: the code in this section is written entirely in Python because it is used far more than R for neural network training. You should be able to follow along without any Python knowledge.

In [3]:
np.random.seed(0)
X = np.array(np.random.uniform(0, 5, size=10), dtype='float32')
X
Out[3]:
array([2.7440674, 3.5759468, 3.0138168, 2.724416 , 2.118274 , 3.2294705,
       2.187936 , 4.458865 , 4.8183136, 1.9172076], dtype=float32)

A 1D array is equivalent to a vector in R. Next, we create a response variable based on the formula $y = 0.5x + 1 + \epsilon$.

In [4]:
error = np.array(np.random.normal(loc=0.0, scale=0.5, size=10), dtype='float32')
y = 0.5*(X + error) + 1
y
Out[4]:
array([2.4080446, 3.1515417, 2.6971679, 2.3926268, 2.1701028, 2.698154 ,
       2.4674878, 3.178143 , 3.4874237, 1.7450799], dtype=float32)

Now we'll visualize our data with a scatter plot.

In [5]:
fig = go.Figure(
    data=go.Scatter(x=X, y=y, mode='markers'), 
    layout=go.Layout(title="Linear Data", width=500, height=500, template="plotly_white"))
fig.show()

Linear Regression With Scikit-learn Module

Python's LinearRegression() is the equivalent of lm() in R; however, it expects the data to be in 2D arrays, so we'll add a dimension with .reshape().

In [6]:
X = X.reshape(-1, 1)
y = y.reshape(-1, 1)

X
Out[6]:
array([[2.7440674],
       [3.5759468],
       [3.0138168],
       [2.724416 ],
       [2.118274 ],
       [3.2294705],
       [2.187936 ],
       [4.458865 ],
       [4.8183136],
       [1.9172076]], dtype=float32)

Now fit a linear model.

In [7]:
lm = LinearRegression(fit_intercept=True)

lm.fit(X, y)
Out[7]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

This should give us a slope close to 0.5 and a y-intercept close to 1.0 ("close" due to the error added to y earlier).

In [8]:
print("Slope", lm.coef_)
print("Y-intercept", lm.intercept_)
Slope [[0.50377744]]
Y-intercept [1.0885315]

Use .flatten() to convert the 2D array back into a 1D array for plotting.

In [9]:
y_pred = lm.predict(X)

fig = go.Figure()
fig.add_trace(go.Scatter(x=X.flatten(), y=y.flatten(), mode='markers', showlegend=False))
fig.add_trace(go.Scatter(x=X.flatten(), y=y_pred.flatten(), mode='lines', showlegend=False))
fig.update_layout(title="Linear Regression Model", width=500, height=500, template="plotly_white")
fig.show()

Linear Regression With A Neural Network

Neural networks use tensors instead of arrays, and what sets tensors apart is the requires_grad=True argument. With this, we can automatically compute the derivative of y with respect to the tensors w and b.

In [10]:
x = torch.tensor(3.)                        # same thing as np.array(3.)
w = torch.tensor(0.5, requires_grad=True)   # the weight (same thing as beta in linear regression)
b = torch.tensor(1., requires_grad=True)    # the y-intercept

We can perform mathematical operations as expected. Using the x, w, and b values specified above, the equation $\hat{y} = wx + b = (0.5)(3) + 1 = 2.5$.

In [11]:
y_hat = w * x + b
y_hat
Out[11]:
tensor(2.5000, grad_fn=<AddBackward0>)

Since we specified requires_grad=True, we can calculate partial derivatives (gradients) with .grad after calling the .backward() method on the response vaiable. For our example, we have:

$$\hat{y} = 3w + b$$

Therefore,

  • w.grad $= \frac{\partial y}{\partial w} = 3$, and
  • b.grad $= \frac{\partial y}{\partial b} = 1$.
In [12]:
y_hat.backward()
print('dy/dw:', w.grad)
print('dy/db:', b.grad)
dy/dw: tensor(3.)
dy/db: tensor(1.)

The neural network equivalent of LinearRegression() is torch.nn.Linear(). Below, we specify the number of predictor variables (the first 1), and reponse variables (the second 1).

In [13]:
torch.manual_seed(0)

nn_model = nn.Linear(1, 1)
nn_model
Out[13]:
Linear(in_features=1, out_features=1, bias=True)

bias=True is the equivalent of fit_intercept=True in linear regression. When we define a neural network model, its parameters are initialized with random numbers. The first parameter is the weight (our w above, which represents the linear model coefficient), the second parameter is the bias (our b above, which represents the y-intercept).

In [14]:
for param in nn_model.parameters(): print(param.data)
tensor([[-0.0075]])
tensor([0.5364])

Since we have values for the model parameters, we can make predictions with the neural network model now. To do that, we pass our X values to our nn_model(). The neural network expects tensors as inputs, so first we convert the numpy arrays into tensors with torch.from_numpy().

In [15]:
inputs = torch.from_numpy(X)
targets = torch.from_numpy(y)
inputs
Out[15]:
tensor([[2.7441],
        [3.5759],
        [3.0138],
        [2.7244],
        [2.1183],
        [3.2295],
        [2.1879],
        [4.4589],
        [4.8183],
        [1.9172]])

Recall that the model parameters are random numbers, so the predictions will not be very accurate. That's ok, though, and it's the starting point for all untrained neural network models. We'll go through the following iterative process to slowly train the model to make more and more accurate predictions.

The Training Process

  1. Make predictions for input values.
  2. Measure the difference between those predictions and the true values (called the loss).
  3. Compute the partial derivative (gradient) of the loss with respect to the model parameters.
  4. Update the model parameters using the partial derivative values computed in the previous step.
  5. Repeat this process until the loss is either unchanged or is sufficiently low.

The next several code chunks demonstrate this process one step at a time.

Step 1. Make predictions.

We can make predictions manually using the randomly initialized weight and bias.

In [16]:
params = [param.data for param in nn_model.parameters()]
w = params[0].clone().detach().requires_grad_(True)
b = params[1].clone().detach().requires_grad_(True)

print(w, b)
tensor([[-0.0075]], requires_grad=True) tensor([0.5364], requires_grad=True)
In [17]:
y_hat_manual = inputs * w + b   # make a prediction
y_hat_manual
Out[17]:
tensor([[0.5159],
        [0.5097],
        [0.5139],
        [0.5160],
        [0.5206],
        [0.5123],
        [0.5201],
        [0.5031],
        [0.5004],
        [0.5221]], grad_fn=<AddBackward0>)

If we pass the inputs to the nn_model() object, we get the same predictions.

In [18]:
y_hat = nn_model(inputs)
y_hat
Out[18]:
tensor([[0.5159],
        [0.5097],
        [0.5139],
        [0.5160],
        [0.5206],
        [0.5123],
        [0.5201],
        [0.5031],
        [0.5004],
        [0.5221]], grad_fn=<AddmmBackward>)

Step 2. Calculate the loss.

There are a number of ways we could do this, but for this example, we'll calculate the loss by determining the mean squared error of the predictions and the target values. Mean squared error is defined as:

$$MSE = \frac{1}{n} \sum\limits_{i=1}^{n}{(y_i - \hat{y}_i)^2}$$

We can calulate this explicitly by:

In [19]:
def MSE(observed, predicted):
    return 1/len(observed) * torch.sum((observed - predicted)**2)

loss_manual = MSE(targets, y_hat_manual)
loss_manual
Out[19]:
tensor(4.7714, grad_fn=<MulBackward0>)

Pytorch has a built-in function F.mse_loss that does the same thing.

In [20]:
loss = F.mse_loss(y_hat, targets)
loss
Out[20]:
tensor(4.7714, grad_fn=<MseLossBackward>)

Step 3. Compute the partial derivative of the loss.

Recall we did this earlier for a single observation using y_hat.backward() and then calling w.grad and b.grad.

In [21]:
loss_manual.backward()  # compute the gradients

print("Weight gradient", w.grad)
print("Bias gradient", b.grad)
Weight gradient tensor([[-13.9623]])
Bias gradient tensor([-4.2524])

A visualization will help to show what's going on. What we're saying is that the loss (or prediction error) is a function of the weight and bias, which is shown below.

In [27]:
ws = np.linspace(-0.5, 1.5, 21)
bs = np.linspace(0, 2, 21)
w_varies = np.linspace(-0.5075, 1.4925, 41)
b_varies = np.linspace(0.0364, 2.0364, 41)

fig = go.Figure(data =
    [go.Surface(x= ws, y= bs,
                z=[[MSE(targets, inputs*ws[j]+bs[i]).item() for j in range(21)] for i in range(21)],
                contours = {"z": {"show": True, "start": 0.0, "end": 1.0, "size": 0.1, "color":"white"}},
                colorbar=dict(len=0.5, lenmode='fraction')),
     go.Scatter3d(x=w_varies, y=[0.5364 for j in range(41)], z=[MSE(targets, inputs*w_varies[i] + 0.5364).item() for i in range(41)], 
                  mode='lines', showlegend=False, line=dict(width=20, color='blue')),
    go.Scatter3d(x=[-0.0075 for j in range(41)], y=b_varies, z=[MSE(targets, inputs*-0.0075 + b_varies[i]).item() for i in range(41)], 
                 mode='lines', showlegend=False, line=dict(width=20, color='blue')),
    go.Scatter3d(x=[-0.0075], y=[0.5364], z=[4.7714], 
                 mode='markers', showlegend=False, marker=dict(size=20, color='green'))])
fig.update_layout(scene = dict(xaxis_title='Weight', yaxis_title='Bias', zaxis_title='Loss'),
                  title="Loss Function", width=800, height=800, xaxis_title="Weight", yaxis_title="Bias", template="plotly_white")
fig.show()

Now we'll consider one cross section at a time starting with weight. Below, the blue line is the loss function, and the red line is the weight gradient when weight = -0.0075. What can we observe from this plot?

  • A negative gradient means we need to increase the weight to decrease the loss.
  • A positive gradient means we need to decrease the weight to decrease the loss.
In [55]:
from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=2, shared_yaxes=True, subplot_titles=("Weight Cross Section", "Bias Cross Section"))
fig.add_trace(go.Scatter(x=w_varies, y=[MSE(targets, inputs*w_varies[i] + 0.5364).item() for i in range(41)], 
                         mode='lines', name='Loss Function', line=dict(color="blue")), row=1, col=1)
fig.add_trace(go.Scatter(x=[0.4925, -0.5075], y=[4.7714-13.9623/2, 4.7714+13.9623/2], 
                         mode='lines', name='Gradient', line=dict(color="red")), row=1, col=1)
fig.add_trace(go.Scatter(x=[-0.0075], y=[4.7714], mode="markers+text", text=["Gradient = -13.96"], textposition="top right", 
                         marker=dict(size=15, color='green'), showlegend=False), row=1, col=1)
fig.add_trace(go.Scatter(x=b_varies, y=[MSE(targets, inputs*-0.0075 + b_varies[i]).item() for i in range(41)], 
                         mode='lines', name='Loss Function', line=dict(color="blue"), showlegend=False), row=1, col=2)
fig.add_trace(go.Scatter(x=[0.0364, 1.0364], y=[4.7714+4.2524/2, 4.7714-4.2524/2], 
                         mode='lines', name='Gradient', line=dict(color="red"), showlegend=False), row=1, col=2)
fig.add_trace(go.Scatter(x=[0.5364], y=[4.7714], mode="markers+text", text=["Gradient = -4.25"], textposition="top right",
                         marker=dict(size=15, color='green'), showlegend=False), row=1, col=2)
fig.update_xaxes(title_text="Weight", row=1, col=1)
fig.update_xaxes(title_text="Bias", row=1, col=2)
fig.update_yaxes(title_text="Loss", row=1, col=1)
fig.update_layout(template="plotly_white")
fig.show()

We'll re-visit this visualization in the next step when we determine how we're going to update our model parameters. For now, we'll also calculate the gradients for the neural network model by calling the .backward() method on the loss function.

In [28]:
loss.backward()

Step 4. Update the model parameters.

Now we need a method for updating model parameters that results in a lower loss (i.e., less error). From the plots above, we know we need to increase the weight and the bias to decrease the loss, but by how much? Right now, all we have to go on are the magnitude of the gradients - keep in mind the algorithm doesn't yet know about the shape of the loss function. It appears that the new weight and bias should be updated as follows:

  • $\omega_{new} = (\omega_{old}) - (\omega_{gradient})(\alpha)$
  • $\beta_{new} = (\beta_{old}) - (\beta_{gradient})(\alpha)$

Where $\alpha$ is a multiplier in the range [0, 1], and is referred to as the learning rate.

Consider different values of $\alpha$.

$\alpha$ $\omega_{new}$ Result $\beta_{new}$ Result Next Loss Value
1.0 13.9548 Very far beyond 0.5, next gradient much larger 4.7904 Very far beyond 1.0, next gradient even larger Very large increase
0.1 1.3887 Somewhat beyond 0.5, next gradient somewhat larger 0.9618 Near 1.0, next gradient smaller Increase to ~5.8
0.01 0.1321 Less than 0.5, next gradient smaller 0.5790 Less than 1.0, next gradient smaller Decrease to ~3.2

Of the three $\alpha$'s considered, only $\alpha = 0.01$ will result in a lower loss value on the next iteration. The following figure highlights the difference between $\alpha=0.1$ and $\alpha=0.01$.

In [83]:
w1 = np.array([1.3387, 0])
b1 = np.array([0, 0.9618])
z1 = w1 + b1
w2 = np.array([0.1321, 0])
b2 = np.array([0, 0.5790])
z2 = w2 + b2

fig = go.Figure(data =
                [go.Contour(x= ws, y= bs, z=[[MSE(targets, inputs*ws[j]+bs[i]).item() for j in range(21)] for i in range(21)], 
                            colorbar=dict(len=0.5, lenmode='fraction')),
                 go.Scatter(x=[-0.0075, z1[0]], y=[0.5364, z1[1]], mode="lines", line=dict(color="red"), name="$\\alpha=0.1$"),
                 go.Scatter(x=[-0.0075, z2[0]], y=[0.5364, z2[1]], mode="lines", line=dict(color="yellow"), name="$\\alpha=0.01$"),
                 go.Scatter(x=[-0.0075], y=[0.5364], mode="markers", marker=dict(size=10, color="green"), name="Initial Loss"),
                 go.Scatter(x=[0.5], y=[1.0], mode="markers", marker=dict(size=10, color="cyan"), name="Approx. Target")])
fig.update_layout(scene = dict(xaxis_title='Weight', yaxis_title='Bias'),
                  title="Gradient Step", width=600, height=500, xaxis_title="Weight", yaxis_title="Bias", template="plotly_white")
fig.show()

Based on our choice of $\alpha=0.01$, we can now update the weight and bias accordingly.

In [47]:
with torch.no_grad():
    w -= w.grad * 1e-2
    b -= b.grad * 1e-2
    
print("Weight", w)
print("Bias", b)
Weight tensor([[0.1321]], requires_grad=True)
Bias tensor([0.5790], requires_grad=True)

Iterating through this process with a sufficiently small $\alpha$ allows the algorith to slowly converge on the weight and bias values that minimizes the loss. This process is known as gradient descent, and with the neural network model, we can use torch.optim.SGD() to do the calculations for us. Notice we get the same weight and bias values as we did with the manual method.

In [84]:
# Define optimizer
opt = torch.optim.SGD(nn_model.parameters(), lr=1e-2)

# update parameters
opt.step()

for param in nn_model.parameters(): print(param.data)
tensor([[0.1321]])
tensor([0.5790])

Be aware that before we iterate through this process again with the neural network model, we need to re-zero the gradients. If we don't, the next gradient value will be added to the current gradient value. We do that with opt.zero_grad().

In [85]:
opt.zero_grad()

Since we changed the model parameters, if we were to recalculate the loss right now, it should be slightly lower.

In [86]:
y_hat = nn_model(inputs)             # make predictions
loss = F.mse_loss(y_hat, targets)    # measure the loss
loss
Out[86]:
tensor(2.8808, grad_fn=<MseLossBackward>)

We then repeat this process until the loss is sufficiently reduced. Each training iteration is commonly referred to as an epoch. I'll start the training process over and keep track of the weight, bias, and loss as training progresses.

In [87]:
weights = []
biases = []
losses = []

torch.manual_seed(0)
nn_model = nn.Linear(1, 1)
opt = torch.optim.SGD(nn_model.parameters(), lr=1e-2)

params = [param.data for param in nn_model.parameters()]
weights.append(params[0].item())
biases.append(params[1].item())

for epoch in range(1000):
    pred = nn_model(inputs)           # make predictions
    loss = F.mse_loss(pred, targets)  # measure the loss as mean squared error
    loss.backward()                   # compute the partial derivative of the loss
    opt.step()                        # update parameters
    params = [param.data for param in nn_model.parameters()]
    weights.append(params[0].item())
    biases.append(params[1].item())
    losses.append(loss.item())
    opt.zero_grad()                   # re-zero the gradients
In [88]:
Epoch = [i for i in range(1000)]

fig = go.Figure()
fig.add_trace(go.Scatter(x=Epoch, y=weights, mode='lines', name='Weight'))
fig.add_shape(type="line", x0=0, y0=0.5, x1=1000, y1=0.5, line=dict(color="black", width=1, dash="dashdot"))
fig.add_trace(go.Scatter(x=Epoch, y=biases, mode='lines', name='Bias'))
fig.add_shape(type="line", x0=0, y0=1, x1=1000, y1=1, line=dict(color="black", width=1, dash="dashdot"))
fig.add_trace(go.Scatter(x=Epoch, y=losses, mode='lines', name='Loss'))
fig.add_trace(go.Scatter(x=[800, 800], y=[0.35, 1.15],
    text=["Weight Optimal Value", "Bias Optimal Value"],
    mode="text", showlegend=False))
fig.update_layout(title="Model Parameters and Loss", xaxis_title="Epoch", width=800, height=500, template="plotly_white")
fig.show()
In [93]:
fig = go.Figure(data = [go.Contour(x=ws, y=bs, z=[[MSE(targets, inputs*ws[j]+bs[i]).item() for j in range(21)] for i in range(21)],
                                  colorbar=dict(len=0.5, lenmode='fraction')),
                       go.Scatter(x=weights[0:1000], y=biases[0:1000], mode='markers', marker=dict(size=5, color='yellow'))])
fig.update_layout(scene = dict(xaxis_title='Weight', yaxis_title='Bias', zaxis_title='Loss'),
                  title="Gradient Descent", width=500, height=500, xaxis_title="Weight", yaxis_title="Bias", template="plotly_white")
fig.show()

After 1000 epochs, the model parameters are very near the linear model values.

In [94]:
print("Linear model slope", lm.coef_)
print("Linear model intercept", lm.intercept_)

print("\nNeural network results")
for param in nn_model.parameters(): print(param.data)
Linear model slope [[0.50377744]]
Linear model intercept [1.0885315]

Neural network results
tensor([[0.5280]])
tensor([1.0080])

Plotting the linear regression lines from both models, we get:

In [30]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=X.flatten(), y=y.flatten(), mode='markers', showlegend=False))
fig.add_trace(go.Scatter(x=X.flatten(), y=y_pred.flatten(), mode='lines', name="Linear Model"))
fig.add_trace(go.Scatter(x=X.flatten(), y=pred.flatten().detach().numpy(), mode='lines', name="NN Model"))
fig.update_layout(title="Linear Regression vs. NN Regression", width=500, height=500, template="plotly_white")
fig.show()

Visualizing the Network

We have so far created a very simple neural network, which we can visualize in the following way.

In [31]:
display(Image(filename='C:\\Users\\jking\\OneDrive\\Documents\\ML\\viz\\simple_nn.png', width=500))

The blue circles are referred to as nodes of the input layer, and the green circle is the output layer, which in this case consists of just one node. Both input nodes are connected to the output node and have associated weights ($\omega$). The output node produces an output value using an activation function. In the case of linear regression, we use a linear activation function which simply takes numeric weights as input and produces a numeric output $\hat{y}$ based on the function $f(\sum\limits_{h}{\omega_h} x_h)$.

Multiple Linear Regression

This framework can easily be expanded to multiple linear regression where we have two or more predictor variables. For two predictors, our linear regression equation would become $y = \omega_0 + \omega_1x_1 + \omega_2x_2$. The above visualization would then change by adding another blue input node for $x_2$ and connecting it to the output node with $\omega_2$. Otherwise, the process described above for training the neural network model remains unchanged.

Overkill?

If you think this seems like overkill just to model a linear relationship between two variables, I'd agree with you. But consider this:

  • What if the relationship between two variables isn't linear?
  • What if there are dozens of predictor variables and dozens of response variables and the underlying relationships are highly complex?

In cases like these, neural networks models can be very beneficial. To make that leap, however, we need to give our neural network model more power by giving it the ability to model these complexities. We do that by adding one or more hidden layers to the model.

The Hidden Layer

Neural network models become univerisal function approximators with the addition of one or more hidden layers. Hidden layers fall between the input layer and the output layer. Adding more predictor variables and one hidden layer, we get the following network.

In [32]:
display(Image(filename='C:\\Users\\jking\\OneDrive\\Documents\\ML\\viz\\hidden_nn.png', width=800))

Wev'e introduced a new variable $\nu$, and a new function $u$. The $\nu$ variables are trainable weights just like the $w$ variables. The $u$ functions are activation functions as described earlier. Typically, all nodes in a hidden layer share a common type of activation function. A variety of activation functions have been developed, a few of which are shown below. For many applications, a rectified linear, activation function is a good choice for hidden layers.

Activation Function Activation Function Formula Output Type
Threshold $f(u) = \begin{Bmatrix} 1, u>0 \\ 0, u\le0 \end{Bmatrix}$ Binary
Linear $f(u) = u$ Numeric
Logistic $f(u) = \frac{e^u}{1+e^u}$ Numeric Between 0 & 1
Rectified Linear $f(u) = \begin{Bmatrix} u, u>0 \\ 0, u\le0 \end{Bmatrix}$ Numeric Positive

As stated, adding a hidden layer turns the neural network model into a universal function approximator. For the case of regression, we can think of this as giving the neural network the ability to model non-linear functions without knowing what the nature of the nonlinear relationship is. Compare that to linear regression. If we had the following polynomial relationship $y = x^2$, we would need to transform either the response or predictor variable in a linear regression model, and this requires knowing the order of the polynomial to get a good fit. With neural network regression, we don't need this knowledge. Instead, we allow the hidden layer to learn the nature of the relationship through the model training process. To demonstrate, we'll generate some data for $y = x^2 + \varepsilon$, where $\varepsilon$ represents some amount of error.

In [62]:
torch.manual_seed(0)

X = torch.unsqueeze(torch.linspace(-1, 1, 100), dim=1)
y = X.pow(2) + 0.2*torch.rand(X.size())   

fig = go.Figure()
fig.add_trace(go.Scatter(x=X.flatten().detach().numpy(), y=y.flatten().detach().numpy(), mode='markers', showlegend=False))
fig.update_layout(title="Polynomial Response", width=800, height=500, template="plotly_white")
fig.show()

Add A Hidden Layer To The Model

Adding a hidden layer to a Pytorch neural network may be done as follows. Notice that bias nodes are added automatically.

In [63]:
nn_model2 = torch.nn.Sequential(
    torch.nn.Linear(1, 10),       # connect 1 input node to 10 nodes in the hidden layer
    torch.nn.ReLU(),              # use a rectified linear activation function
    torch.nn.Linear(10, 1),       # connect the 10 hidden layer nodes to the 1 output node
)

nn_model2
Out[63]:
Sequential(
  (0): Linear(in_features=1, out_features=10, bias=True)
  (1): ReLU()
  (2): Linear(in_features=10, out_features=1, bias=True)
)

That's all there is to it! We now train the model exactly as we did before.

In [64]:
opt = torch.optim.SGD(nn_model2.parameters(), lr=0.05)

for epoch in range(500):
    pred = nn_model2(X)         # make predictions
    loss = F.mse_loss(pred, y)  # measure the loss
    loss.backward()             # calculate the gradients
    opt.step()                  # update the model parameters
    opt.zero_grad()             # re-zero the gradients
    if epoch % 50 == 0: print('Training loss: ', F.mse_loss(nn_model2(inputs), targets))
Training loss:  tensor(5.4805, grad_fn=<MseLossBackward>)
Training loss:  tensor(3.0296, grad_fn=<MseLossBackward>)
Training loss:  tensor(1.6594, grad_fn=<MseLossBackward>)
Training loss:  tensor(0.7494, grad_fn=<MseLossBackward>)
Training loss:  tensor(0.2984, grad_fn=<MseLossBackward>)
Training loss:  tensor(0.1453, grad_fn=<MseLossBackward>)
Training loss:  tensor(0.1478, grad_fn=<MseLossBackward>)
Training loss:  tensor(0.2139, grad_fn=<MseLossBackward>)
Training loss:  tensor(0.2947, grad_fn=<MseLossBackward>)
Training loss:  tensor(0.3698, grad_fn=<MseLossBackward>)

Let's look at the fit.

In [65]:
newx = torch.from_numpy(np.arange(-1, 1.1, 0.1, dtype='float32').reshape(-1,1))
newpreds = nn_model2(newx)

fig = go.Figure()
fig.add_trace(go.Scatter(x=X.flatten().detach().numpy(), y=y.flatten().detach().numpy(), mode='markers', showlegend=False))
fig.add_trace(go.Scatter(x=newx.flatten().detach().numpy(), y=newpreds.flatten().detach().numpy(), mode='lines', showlegend=False))
fig.update_layout(title="Neural Network With Hidden Layer", width=800, height=500, template="plotly_white")
fig.show()

Let's look at the model parameters. We should have a total of 31 trainable parameters: 10 from the input node to the hidden layer, 10 from the bias node to the hidden layer, 10 from the hidden layer to the output node, and one from the bias node in the hidden layer to the output node.

In [59]:
for name, param in nn_model2.named_parameters():
    if param.requires_grad:
        print(name, param.data)
0.weight tensor([[ 1.2948],
        [ 0.7755],
        [-0.4360],
        [-0.2824],
        [-0.8643],
        [-0.0180],
        [-0.9445],
        [-0.9520],
        [-0.0434],
        [ 0.1807]])
0.bias tensor([-0.2901,  0.4698, -0.6085,  0.8950,  0.7229, -0.8433, -0.3125, -0.1428,
         0.1427,  0.2347])
2.weight tensor([[ 0.8514,  0.1464, -0.1543,  0.1889, -0.0057, -0.1874,  0.6076,  0.5984,
         -0.1020, -0.2749]])
2.bias tensor([-0.0597])

Let's do that again, only this time, we'll keep track of the predictions as we train the model so that we can plot them over time.

In [66]:
torch.manual_seed(0)

nn_model2 = torch.nn.Sequential(
    torch.nn.Linear(1, 10),       # connect 1 input node to 10 nodes in the hidden layer
    torch.nn.ReLU(),              # use a rectified linear activation function
    torch.nn.Linear(10, 1),       # connect the 10 hidden layer nodes to the 1 output node
)

opt = torch.optim.SGD(nn_model2.parameters(), lr=0.05)

for epoch in range(500):
    pred = nn_model2(X)
    if epoch % 10 == 0: 
        newpred = nn_model2(newx)
        if epoch == 0:
            df = pd.DataFrame(data = {'x': newx.flatten().detach().numpy(), 
                                      'y': newpred.flatten().detach().numpy(), 
                                      'frame': [epoch/10 for i in range(21)]})
        else:
            newdf = pd.DataFrame(data = {'x': newx.flatten().detach().numpy(), 
                                         'y': newpred.flatten().detach().numpy(), 
                                         'frame': [epoch/10 for i in range(21)]})
            df = pd.concat([df, newdf])   
    loss = F.mse_loss(pred, y)
    opt.zero_grad()
    loss.backward()
    opt.step()
In [67]:
fig = go.Figure(
    data=[go.Scatter(x=X.flatten().detach().numpy(), y=y.flatten().detach().numpy(),
                     mode="markers", name='Data',
                     line=dict(width=2, color="blue")),
          go.Scatter(x=df.loc[df['frame']==0, 'x'], y=df.loc[df['frame']==0, 'y'],
                     mode="lines",
                     name="Fit",
                     line=dict(color="red", width=2))
         ],
    layout=go.Layout(
        xaxis=dict(range=[df['x'].min(), df['x'].max()], autorange=False, zeroline=False),
        yaxis=dict(range=[df['y'].min(), df['y'].max()], autorange=False, zeroline=False),
        title_text="NN Regression Fit During Training", hovermode="closest", 
        template="plotly_white", width=800, height=500,
        updatemenus=[dict(type="buttons",
                          buttons=[dict(label="Play",
                                        method="animate",
                                        args=[None])])]),
    frames=[go.Frame(
        data=[go.Scatter(x=X.flatten().detach().numpy(), y=y.flatten().detach().numpy(),
                         mode="markers", name='Data',
                         line=dict(width=2, color="blue")),
              go.Scatter(x=df.loc[df['frame']==k, 'x'], y=df.loc[df['frame']==k, 'y'],
                         mode="lines", name="Fit",
                         line=dict(color="red", width=2))
              ])
            for k in range(50)]
)

fig.show()